Lenses: An On-Demand Approach to ETL

نویسندگان

  • Ying Yang
  • Niccolò Meneghetti
  • Ronny Fehling
  • Zhen Hua Liu
  • Oliver Kennedy
چکیده

Three mentalities have emerged in analytics. One view holds that reliable analytics is impossible without high-quality data, and relies on heavy-duty ETL processes and upfront data curation to provide it. The second view takes a more ad-hoc approach, collecting data into a data lake, and placing responsibility for data quality on the analyst querying it. A third, on-demand approach has emerged over the past decade in the form of numerous systems like Paygo or HLog, which allow for incremental curation of the data and help analysts to make principled trade-offs between data quality and effort. Though quite useful in isolation, these systems target only specific quality problems (e.g., Paygo targets only schema matching and entity resolution). In this paper, we explore the design of a general, extensible infrastructure for on-demand curation that is based on probabilistic query processing. We illustrate its generality through examples and show how such an infrastructure can be used to gracefully make existing ETL workflows “on-demand”. Finally, we present a user interface for On-Demand ETL and address ensuing challenges, including that of efficiently ranking potential data curation tasks. Our experimental results show that On-Demand ETL is feasible and that our greedy ranking strategy for curation tasks, called CPI, is effective.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-Demand ELT Architecture for Right-Time BI: Extending the Vision

In a typical BI infrastructure, data, extracted from operational data sources, is transformed and cleansed, and subsequently loaded into a data warehouse where it can be queried for reporting purposes. ETL — the process of extraction, transformation, and loading, is a periodic process that may involve an elaborate and rather established software ecosystem. Typically, the actual ETL process is e...

متن کامل

CloudETL: Scalable Dimensional ETL for Hadoop and Hive

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a D...

متن کامل

An Approach for the Estimation of Aggregate Potential Telecommuting Demand

Development of technology has made possible the invention of innovative and modern methods to solve partially the problems caused by traffic congestion, through decreasing the need for physical transportation; one such method being telecommuting. Few predictions have been reported regarding its aggregate demand at the level of a city, generally because of the complexity and multi-dimensionality...

متن کامل

Big-ETL: Extracting-Transforming-Loading Approach for Big Data

ETL process (Extracting-Transforming-Loading) is responsible for (E)xtracting data from heterogeneous sources, (T)ransforming and finally (L)oading them into a data warehouse (DW). Nowadays, Internet and Web 2.0 are generating data at an increasing rate, and therefore put the information systems (IS) face to the challenge of big data. Data integration systems and ETL, in particular, should be r...

متن کامل

METL: Managing and Integrating ETL Processes

Companies use Extract-Transform-Load (Etl) tools to save time and costs when developing and maintaining data migration tasks. Etl tools allow the definition of often complex processes to extract, transform, and load heterogeneous data into a data warehouse or to perform other data migration tasks. In larger organizations many Etl processes of different data integration and warehouse projects ac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015